Lesson 6

Welcome

Notes:1、了解钻石市场背后的财富历史,并使用EDA的方法,来获得定量认识
2、最终目的是为了建议一个钻石的预测模型,来帮助你判断一个给定的钻石,是很划算,还是在敲竹杠;
3、探索这个数据集,可帮助你了解钻石价格的成因;

线性回归模型

线性回归(Linear regression)是利用称为线性回归方程的最小二乘函数对一个或多个自变量和因变量之间关系进行建模的一种回归分析。

线性回归有很多实际用途。分为以下两大类: 1、如果目标是预测或者映射,线性回归可以用来对观测数据集的和X的值拟合出一个预测模型。当完成这样一个模型以后,对于一个新增的X值,在没有给定与它相配对的y的情况下,可以用这个拟合过的模型预测出一个y值。 2、给定一个变量y和一些变量X1,…,Xp,这些变量有可能与y相关,线性回归分析可以用来量化y与Xj之间相关性的强度,评估出与y不相关的Xj,并识别出哪些Xj的子集包含了关于y的冗余信息。 ***

Scatterplot Review

建立price和carat的散点图,去掉最高的点

data("diamonds")
## Warning in data("diamonds"): data set 'diamonds' not found
library(ggplot2)
ggplot(aes(x = carat, y = price), 
       data = diamonds) + 
  geom_point() + 
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) + 
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing missing values (geom_point).

#ggsave("carat_price.png")

***

Price and Carat Relationship

Response:从上图可以看出价格和克拉数之间的关系是:
1)随着carat增加,price整体呈上升趋势;
2)同一克拉数,价格区间非常大

我们可以向图中增加一条线性的裁切线,方法是使用统计平滑函数 ## Add on Linear Model

ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(color = '#F79420', alpha = 1/4) + 
  stat_smooth(method = 'lm') + 
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) + 
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).

#ggsave("carat_price_linear.png")

从上图可以看到,趋势线在一些关键区域不会穿过数据的中心。在关系的中心应有点弧度,而且应该向尾部更朝上点倾斜,如果我们尝试用这个来做预测,我们可能会在所显示的现有数据内部和外部错过一些关键区域。 ***

Frances Gerety

Notes:对数据集做一个简单的介绍。包含了diamondsc.info在2008年收集的5万多颗钻石。 研究这个数据还是比较有用的,因为钻石和我们使用的其他产品不一样,你没办法根据型号就知道它的价格。

A diamonds is forever.


ggpairs Function

Notes:这个函数绘制每个变量之间的关系图,我们的目标是了解钻石的价格

# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, 
  lower = list(continuous = wrap("points", shape = I('.'))), 
  upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#ggsave("diamonds_ggpairs.png")

What are some things you notice in the ggpairs output? Response: 1、carat-price 基本上呈现指数级增长; 2、price与x、z的关系比较明显,但与其他变量关系并不明显


The Demand of Diamonds

Notes:

library(gridExtra)

plot1 <- ggplot(aes(x = price), data = diamonds) + 
  geom_histogram(binwidth = 100, color = '#099DD9') + 
  ggtitle('Price')

plot2 <- ggplot(aes(x = price), data = diamonds) + 
  geom_histogram(binwidth = 0.01, color = '#F79420') + 
  scale_x_log10() + 
  ggtitle('Price (log10)')

grid.arrange(plot1, plot2, ncol=2)


Scatterplot Transformation

qplot(carat, price, data = diamonds) + 
  scale_y_continuous(trans = log10_trans()) + 
  ggtitle('Price (log10) by Carat')

#ggsave("carat_price_log10.png")

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).

#ggsave("price_log10_by_cube_root_of_carat.png")

***

Overplotting Revisited

head(sort(table(diamonds$carat), decreasing = T))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121

由于有很多点出现重合的现象,我们可以通过抖动点,或者添加透明度来让点变小些

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha = 1/2, size = 3/4, position = "jitter") + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).

#ggsave("overplotting_price_carat.png")

***

Other Qualitative Factors

Notes:


Price vs. Carat and Clarity

Alter the code below.

# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)
library(ggplot2)

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).

#ggsave("price_carat_clarity.png")

***

Clarity and Price

Response:从图中可以看出,净度可以说明价格,当carat一致时,净度越高,价格越高。


Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).

#ggsave("price_carat_cut.png")

***

Cut and Price

Response:Cut 不影响价格,因为克拉数相同时,不同切工之间价格相差并不明显,而且大多数钻石都是理想切割。


Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color', reverse = FALSE,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).

#ggsave("price_carat_color.png")

***

Color and Price

Response:虽然根据专业分析,颜色肉眼识别不出来,但是颜色对价格还是有影响,从图中可以看出,克拉数一致时,颜色越纯净,价格相对越高。


Linear Models in R

Notes: 在r中,我们可以用lm函数创建模型,我们需要按照yx格式提供一个公式,lm(yx),其中y是结果变量,x是解释变量

Response: 我们对长尾变量进行对数变换,我们估计随着钻石数量增加,无瑕疵钻石将呈现指数规律减少,所以我们应该关注克拉重量的立方根。


Building the Linear Model

Notes:下面我们为价格建立线性模型,查看R中的线性模型和运算符:http://data.princeton.edu/R/linearModels.html

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)  #第一个模型m1
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## =========================================================================
##                      m1         m2         m3         m4         m5      
## -------------------------------------------------------------------------
##   (Intercept)      2.821***   1.039***   0.874***   0.932***   0.415***  
##                   (0.006)    (0.019)    (0.019)    (0.017)    (0.010)    
##   I(carat^(1/3))   5.558***   8.568***   8.703***   8.438***   9.144***  
##                   (0.007)    (0.032)    (0.031)    (0.028)    (0.016)    
##   carat                      -1.137***  -1.163***  -0.992***  -1.093***  
##                              (0.012)    (0.011)    (0.010)    (0.006)    
##   cut: .L                                0.224***   0.224***   0.120***  
##                                         (0.004)    (0.004)    (0.002)    
##   cut: .Q                               -0.062***  -0.062***  -0.031***  
##                                         (0.004)    (0.003)    (0.002)    
##   cut: .C                                0.051***   0.052***   0.014***  
##                                         (0.003)    (0.003)    (0.002)    
##   cut: ^4                                0.018***   0.018***  -0.002     
##                                         (0.003)    (0.002)    (0.001)    
##   color: .L                                        -0.373***  -0.441***  
##                                                    (0.003)    (0.002)    
##   color: .Q                                        -0.129***  -0.093***  
##                                                    (0.003)    (0.002)    
##   color: .C                                         0.001     -0.013***  
##                                                    (0.003)    (0.002)    
##   color: ^4                                         0.029***   0.012***  
##                                                    (0.003)    (0.002)    
##   color: ^5                                        -0.016***  -0.003*    
##                                                    (0.003)    (0.001)    
##   color: ^6                                        -0.023***   0.001     
##                                                    (0.002)    (0.001)    
##   clarity: .L                                                  0.907***  
##                                                               (0.003)    
##   clarity: .Q                                                 -0.240***  
##                                                               (0.003)    
##   clarity: .C                                                  0.131***  
##                                                               (0.003)    
##   clarity: ^4                                                 -0.063***  
##                                                               (0.002)    
##   clarity: ^5                                                  0.026***  
##                                                               (0.002)    
##   clarity: ^6                                                 -0.002     
##                                                               (0.002)    
##   clarity: ^7                                                  0.032***  
##                                                               (0.001)    
## -------------------------------------------------------------------------
##   R-squared            0.9        0.9        0.9        1.0        1.0   
##   adj. R-squared       0.9        0.9        0.9        1.0        1.0   
##   sigma                0.3        0.3        0.3        0.2        0.1   
##   F               652012.1   387489.4   138654.5    87959.5   173791.1   
##   p                    0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood   -7962.5    -3631.3    -1837.4     4235.2    34091.3   
##   Deviance          4242.8     3613.4     3380.8     2699.2      892.2   
##   AIC              15931.0     7270.6     3690.8    -8442.5   -68140.5   
##   BIC              15957.7     7306.2     3762.0    -8317.9   -67953.7   
##   N                53940      53940      53940      53940      53940     
## =========================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Video Notes:

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response: 1)近几年的钻石价格s:http://www.pricescope.com/diamond-prices/diamond-prices-chart

2)2013年全球钻石报告:http://www.bain.com/publications/articles/global-diamond-report-2013.aspx

3)R中的回归系数:http://www.r-bloggers.com/interpreting-regression-coefficient-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

4)解释回归系数:http://www.theanalysisfactor.com/interpreting-regression-coefficients/

5)拟合与解释线性模型:http://blog.yhathq.com/posts/r-lm-summary.html

6)线性模型中因子系数的另一种解释:http://stats.stackexchange.com/a/24256


A Bigger, Better Data Set

Notes:

#install.packages('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')

#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))

load("BigDiamonds.rda")

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:

names(diamondsbig)
##  [1] "carat"        "cut"          "color"        "clarity"     
##  [5] "table"        "depth"        "cert"         "measurements"
##  [9] "price"        "x"            "y"            "z"
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig)  #第一个模型m1
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamondsbig)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamondsbig)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamondsbig)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsbig)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsbig)
## 
## =========================================================================
##                      m1         m2         m3         m4         m5      
## -------------------------------------------------------------------------
##   (Intercept)      3.096***   1.406***   1.218***   0.405***  -0.663***  
##                   (0.002)    (0.005)    (0.005)    (0.006)    (0.006)    
##   I(carat^(1/3))   5.317***   7.911***   7.920***   8.170***   8.368***  
##                   (0.002)    (0.008)    (0.008)    (0.007)    (0.005)    
##   carat                      -0.767***  -0.779***  -0.782***  -0.815***  
##                              (0.002)    (0.002)    (0.002)    (0.001)    
##   cut: V.Good                            0.119***   0.092***   0.059***  
##                                         (0.002)    (0.002)    (0.001)    
##   cut: Ideal                             0.256***   0.222***   0.130***  
##                                         (0.002)    (0.001)    (0.001)    
##   color: K/L                                        0.134***   0.128***  
##                                                    (0.004)    (0.003)    
##   color: J/L                                        0.302***   0.325***  
##                                                    (0.004)    (0.003)    
##   color: I/L                                        0.422***   0.457***  
##                                                    (0.003)    (0.003)    
##   color: H/L                                        0.517***   0.560***  
##                                                    (0.003)    (0.003)    
##   color: G/L                                        0.627***   0.661***  
##                                                    (0.003)    (0.002)    
##   color: F/L                                        0.723***   0.751***  
##                                                    (0.003)    (0.002)    
##   color: E/L                                        0.790***   0.805***  
##                                                    (0.003)    (0.002)    
##   color: D/L                                        0.894***   0.886***  
##                                                    (0.003)    (0.003)    
##   clarity: I1                                                  0.355***  
##                                                               (0.005)    
##   clarity: SI2                                                 0.684***  
##                                                               (0.005)    
##   clarity: SI1                                                 0.834***  
##                                                               (0.005)    
##   clarity: VS2                                                 0.979***  
##                                                               (0.005)    
##   clarity: VS1                                                 1.067***  
##                                                               (0.005)    
##   clarity: VVS2                                                1.145***  
##                                                               (0.005)    
##   clarity: VVS1                                                1.224***  
##                                                               (0.005)    
##   clarity: IF                                                  1.346***  
##                                                               (0.005)    
## -------------------------------------------------------------------------
##   R-squared             0.9        0.9        0.9        0.9       1.0   
##   adj. R-squared        0.9        0.9        0.9        0.9       1.0   
##   sigma                 0.4        0.4        0.4        0.3       0.2   
##   F               5000339.6  3039162.6  1608418.4   785005.6  903078.3   
##   p                     0.0        0.0        0.0        0.0       0.0   
##   Log-likelihood  -336293.7  -283694.8  -268204.9  -162478.6   23289.3   
##   Deviance         107833.3    90420.2    85850.0    60255.5   32348.9   
##   AIC              672593.4   567397.6   536421.8   324985.2  -46534.6   
##   BIC              672627.3   567442.8   536489.6   325143.4  -46286.0   
##   N                597311     597311     597311     597311    597311     
## =========================================================================

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)

exp(modelEstimate)
##        fit     lwr      upr
## 1 4786.053 3033.06 7552.207

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.


Final Thoughts

Notes:


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!